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Introduction and Course Agenda 2 



• “Big Data” is a popular term which refers to the exponential 
growth and availability of data, both structured and 
unstructured 

• In 2001, industry analyst Doug Laney attributed the “three 
Vs” to describe the definition of big data 

o Volume 
o Velocity 
o Variety 
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Volume 

There has been a large increase of data volume. There are 
multiple reasons for this.. 

• All of the transactional data that has added up over the years 

• Streaming data from social media 

• Machine to machine data increase 

Initially, storage was a big concern but with costs of storage 
dropping, it is not as big of a threat as things like analytics. 

So, Data Analytical is the main topic that is concerned in this 
course. 
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Velocity 

Data is being streamed at huge speeds and needs to be 
dealt with in a timely manner. Some examples are... 

• Social Media 

• Mobile Devices 

The biggest challenge is how to react fast enough to the 
massive amount of data that is being flew rapidly 
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Variety 

There are many different types of data 

• Email 


Structured Data 
Numeric Data 
Application Data 


Audio & Video 
Financial Transactions 


Unstructured Documents 


Managing all the different formats is an issue many 
organizations have to battle 
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4 V’s 

o Volume (size) 
o Velocity (rapidly streaming) 
o Variety (many forms) 
o Veracity: Uncertainty of data 

refers to the trustworthiness of the data. With many forms of big 
data, quality and accuracy are less controllable (just think of 
Twitter posts with hash tags 

5 V’s.... Value of data is added 

Well and good for access or useless data (Business value of 
data) 
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Module 1: Introduction to Big Data Analytics 

Upon completion of this module, you should be able to: 

• Define big data 

• Identify four business drivers for advanced analytics 

• Distinguish the techniques for Business Intelligence from those of 
Data Science 

• Describe the role of the Data Scientist within the new big data 
ecosystem 

• Cite at least three illustrative examples of big data opportunities 
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Module 1 : Introduction to Big Data Analytics 


Big Data Overview 

During this part the following topics are covered: 

• Definition of big data 

• Big data characteristics and considerations 

• Unstructured data supporting big data analytics 

• Analyst perspective on Data Repositories (the evolution of 
data repositories) 
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Introduction to Big Data Analytics 


What is Big Data ? 


What makes data, "Big" Data ? 
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Big Data Definition 

• "Big Data" is data whose scale, distribution, diversity, and/or 
timeliness require the use of new technical architectures and 
analytics to enable insights that unlock new sources of 
business value. 

► Requires new data architectures, analytic sandboxes 

► New tools/ technologies to store, manage and realize the business 
benefit of these large data sets 

► New analytical methods 

► Integrating multiple skills into new role of data scientist 

• Organizations are deriving business benefit from analyzing ever 
larger and more complex data sets that increasingly require 
real-time or near-real time capabilities 

• Big Data is not just a scientific term. It has a business value. 


Source: McKinsey May 201 1 article Big Data: The next frontier for innovation, competition, and productivity 
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Key Characteristics of Big Data 

1. Data Volume 

► 44x increase from 2009 to 2020 
(0.8 zettabytes to 35.2zb) 

Highly rate of growth (very accelerating) 

2. Processing Complexity 

► Changing data structures 

► Use cases requires additional transformations and different analytical 
techniques 

► The preferred approach for processing big data is in parallel computing 
environments and Massively Parallel Processing , which enable 
simultaneous, parallel loading and analysis of data. 

3. Data Structure 

► Greater variety of data structures to mine and analyze 

► Most of the big data is unstructured or semi-structured in nature, which 
requires different techniques and tools to process and analyze. 


Big Data Size: The Volume Of Data 
ContinuesTo Explode 

The Digital Universe 2009 - 2020 


2009 : 

0.8 

Zettabytes 


2010 : 

1.2 

Zettabytes 
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More Structured 


Big Data Characteristics: Data Structures 


Data Growth is Increasingly Unstructured 


Structured 

Semi- 

Structured 

“Quasi” 

Structured 

Unstructured 


• Data containing a defined data type, format, structure 

• Example: Transaction data and OLAP 


• Textual data files with a discernable pattern, 
enabling parsing 

• Example: XML data files that are self 
describing and defined by an xml schema 


• Textual data with unorganized data 
formats, can be formatted with effort, 
tools, and time 

• Example: Web clickstream data that 
may contain some inconsistencies in data 
values and formats 


• Data that has no inherent 
structure and is usually stored 
as different types of files. 

• Example: Text documents, 
PDFs, images and video 
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Four Main Types of Data Structures 


Structured Data 


SUMMER FOOD SERVICE PROGRAM 11 

(Data as of Auqust 01. 2011) 

Fiscal 

Year 

Number of 
Sites 

Peak (July) 
Participation 

Meals 

Served 

Total Federal 
Expenditures 2] 


Thousands 

-Mil- 

—Million $— 

1969 

1.2 

99 

2.2 

0.3 

1970 

1.9 

227 

8.2 

1.8 

1971 

3.2 

569 

29.0 

8.2 

1972 

6.5 

1.080 

73.5 

21.9 

1973 

11.2 

1.437 

65.4 

26.6 

1974 

10.6 

1.403 

63.6 

33.6 

1975 

12.0 

1.785 

84.3 

50.3 

1976 

16.0 

2.453 

104.8 

73.4 

TQ 3] 

22.4 

3.455 

198.0 

88.9 

1977 

23.7 

2.791 

170.4 

114.4 

1978 

22.4 

2.333 

120.3 

100.3 

1979 

23.0 

2.126 

121.8 

108.6 

1980 

21.6 

1.922 

108.2 

110.1 


Semi-Structured Data 


View -> Source 


o 


Toolbars 
Quick Tabs 
Explorer Bars 




CLOUD AND BIG DATA 
HITTHEROAD 

EMC FORUM 2011 



Text Size 
Encoding 
Style 

Caret Browsing 


FI 


Jti 


emC 



< ! DOCTYPE html PUBLIC "-/ /W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtmll/DTD/xhtmll-trans: 
<html xmlns="http : //www. w3 . org/1999/xhtml"> 

<head> 

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> 

<META name="y_)cey" content="859b4020elc9acec"> 

<lin)c rel="canonical" href ="http: //www. emc.com/index.htm" /> 

<META NAME="verify-vl" CONTENT="yiZt9VOP4eVOjFdIPeWIfRP32g4qtwFEOI2UvTMfSU 
<title>EMC - Data Recovery, Cloud Computing, and Storage Hardware</title> 

<META NAME="description" CONTENT="EMC is a leading provider of storage hardware solutions tfc 
data recovery and improve cloud conputing." /> 

<META NAME="lceywords" CONTENT="emc, network storage, data recovery, inf ormation manager 
software, nas storage, information protection, information management" /> 

< ! — Start :stylehseet incldues — > 
clinic rel="stylesheet" href="/_admin/css/styles . css" /> 

Clinic rel="stylesheet" href="/_admin/css/styles_nav. css" /> 

c! — [i f IE] > _ j 


Quasi-Structured Data 



The Red Wheelbarrow, by 
William Carlos Williams 

so much depends 
upon 

a red wheel 
barrow 

glazed with rain 
water 

beside the white 
chickens. 
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Data Repositories, An Analyst Perspective 


Data Islands Data Warehouses Analytic Sandbox 

“Spreadmarts” 

Centralized data containers Data assets gathered from multiple 

Isolated data marts m a purpose-built space sources and technologies for analysis 



Spreadsheets and low- 
volume DB's for record/ 
keeping 

Analyst dependent on 
data extracts 


Supports Bl and reporting, but 
restricts robust analyses or 
data exploration 

Analyst dependent on IT & 
DBAs for data access and 
schema changes 

Analysts must spend significant 
time to get extracts from 
multiple sources 


Enables high performance analytics 
using in-db processing 

Reduces costs associated with data 
replication into "shadow" file 
systems 

"Analyst-owned" rather than " DBA 
owned" 

More robust analyses 
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Introduction to Big Data Analytics: Mini-Case Study 


Yoyodyne Bank Scenario 

• Evolving from small community bank to a global bank 

• Needs to move away from its inheritance mainframes to an environment that 
supports more robust analytics 

• Growing through mergers and acquisitions 

• Subject to many new regulatory requirements 

• Increasing customer base and increased product offerings 

Your Thoughts? 

Discussion Questions 

1. Discuss how the bank's data would change under these circumstances. 

2. How are their needs changing with these business changes? 

3. What do you need to consider from an analyst point of view? What are 
some things to consider implementing as the bank grows? 
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Module 1 : Introduction to Big Data Analytics 


Summary 

During this part the following topics were covered: 

• Definition of big data 

• Big data characteristics and considerations 

• Unstructured data fueling big data analytics 

• Analyst perspective on Data Repositories 
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Module 1 : Introduction to Big Data Analytics 


State of the Practice in Analytics 


During this part the following topics are covered: 

• Business drivers for analytics 

• Current analytical architecture 

• Business intelligence vs. data science 

• Drivers of big data and new big data ecosystem 
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Business Drivers 


Business Drivers: People, knowledge , and conditions that initiate and 
support activities for which the business was designed. 


r \ 

Current Business Problems Provide Opportunities for Organizations to 

Become More Analytical & Data Driven 

v ) 
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Business Drivers for Analytics 


Here are 4 examples of common business problems that organizations contend 
with today, where they have an opportunity to do advanced analytics to create 
competitive advantage . Rather than doing standard reporting on these areas 



Driver 

Examples 

© 

Desire to optimize business operations 
and derive more values from these 
typical tasks 

Sales, pricing, profitability, efficiency 

® 

Desire to identify business risk to 
reduce it 

Customer churn, fraud, default 

® 

Predict new business opportunities 

Upsell, cross-sell, best new customer 
prospects 

© 

Obey laws or regulatory requirements 

Anti-Money Laundering, Fair Lending 
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Analytical Approaches for Meeting Business Drivers 

Business Intelligence vs. Data Science 


High 


BUSINESS 

VALUE 


Low 


Business 

Intelligence 


Past 


TIME 



Predictive A 
(Data Sciem 

analytics & Data Mining 
ce) 

Typical 
Techniques 
& Data Types 

• Optimization, predictive modeling, 
statistical analysis, machine learning 
techniques as Naive Bayes or regression 

• Structured/unstructured data, many types 
of sources, very large data sets 

Common 

Questions 

• What if ? 

• What’s the optimal scenario for our 
business ? 

• What will happen next? What if these 
trends continue? Why is this happening? 

Open ended questions 


| Business Int 

:elligence 

Typical 
Techniques & 
Data Types 

• Standard and ad hoc reporting, 
dashboards that provides KPIs, alerts, 
queries, details on demand 

• Structured data, traditional sources, 
manageable data sets 

Common 

Questions 

• What happened last quarter? 

• How many did we sell? 

• Where is the problem? In which 
situations? 


Future 
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A Typical Analytical Architecture 

, © 
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Implications of Typical Architecture for Data Science 


• High-value data analytics is hard to reach and 
leverage 


• Predictive analytics & data mining activities are last in 
line for data 

► Queued after prioritized operational processes 

• Data is moving in batches from EDW to local 
analytical tools 

► In-memory analytics (such as R, SAS, SPSS, Excel) 

► Sampling can skew model accuracy 


Slow 

“time-to-insight” 

& 

reduced 

business impact 


• Isolated, ad hoc analytic projects, rather than 
centrally-managed of analytics 

► Frequently, not aligned with corporate business goals 
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Opportunities for a New Approach to Analytics 

New Applications Driving Data Volume 

The Big Data trend is generating an enormous amount of information that requires 
advanced analytics and new market players to take advantage of it. 


LARGE 

Z 
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MEASURED IN 

TERABYTES 

1TB = 1,000GB 



ORACLE 

1990 ’ S 

(RDBMS & DATA 
WAREHOUSE) 


MEASURED IN 

PETABYTES 

1PB = 1,000TB 


WILL BE MEASURED IN 

EXABYTES 

1 EB = 1 ,000PB 






2000’ s 

(CONTENT & DIGITAL ASSET 
MANAGEMENT) 


2010’ s 

(NO-SQL & KEY/VALUE) 
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Opportunities for a New Approach to Analytics 


Big Data Ecosystem 
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Opportunities for a New Approach to Analytics (Continued) 
Big Data Ecosystem 


Key Concepts: 

A) Significant opportunities exist to extract value from Big Data 

B) Entities are emerging throughout the new Big Data ecosystem to 
capitalize on these opportunities -from 

1. Data Devices, 

2. Data Collectors, 

3. Data Aggregators, 

4. Data Users / Buyers 

C) To accomplish this, these players will need to adopt a new analytic 
architectures and methods 
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Opportunities for a New Approach to Analytics (Continue 


Big Data Ecosystem 
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Considerations for Big Data Analytics 


| Criteria for Big Data Projects 


New Analytic Architecture 


Analytic Sandbox 

Data assets gathered from multiple sources 
and technologies for analysis 



Copyright © 2014 EMC Corporation. All Rights Reserved. 


Module 1: Introduction to BDA 33 





Underwriting Risk 


State of the Practice in Analytics: Mini-Case Study 

Big Data Enabled Loan Processing at YoyoDyne 



TRADITIONAL DATA LEVERAGED BIG DATA LEVERAGED 
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Underwriting Risk 


State of the Practice in Analytics: Mini-Case Study 

Big Data Enabled Loan Processing at YoyoDyne 
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Module 1 : Introduction to Big Data Analytics 

Summary 

During this part the following topics were covered: 

• Business drivers for analytics 

• Current analytical architecture 

• Business intelligence vs. data science 

• Drivers of big data and new big data ecosystem 
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Introduction I Analytics Lifecycle M Basic Methods 


Adv. Methods 



Thanks 
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